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ABSTRACT 

We develop a statistical tool SNVer for calling com- 
mon and rare variants in analysis of pooled or indi- 
vidual next-generation sequencing (NGS) data. We 
formulate variant calling as a hypothesis testing 
problem and employ a binomial-binomial model to 
test the significance of observed allele frequency 
against sequencing error. SNVer reports one single 
overall P-value for evaluating the significance of a 
candidate locus being a variant based on which 
multiplicity control can be obtained. This is particu- 
larly desirable because tens of thousands loci are 
simultaneously examined in typical NGS experi- 
ments. Each user can choose the false-positive 
error rate threshold he or she considers appropri- 
ate, instead of just the dichotomous decisions of 
whether to 'accept or reject the candidates' pro- 
vided by most existing methods. We use both 
simulated data and real data to demonstrate the 
superior performance of our program in comparison 
with existing methods. SNVer runs very fast and can 
complete testing 300 K loci within an hour. This ex- 
cellent scalability makes it feasible for analysis of 
whole-exome sequencing data, or even whole- 
genome sequencing data using high performance 
computing cluster. SNVer is freely available at 
http://snver.sourceforge.net/. 

INTRODUCTION 

The past few years have seen a dramatic development in 
sequencing technology, which has made the per-base cost 
of DNA sequencing plummet by ~ 100 000-fold over the 
past decade (1). Because of the affordable cost and high 



digital resolution, the new or 'next-generation' sequencing 
(NGS) technology is replacing the traditional 
hybridization-based microarray technology in many appli- 
cations (2). For genetics studies, NGS holds the promise 
to revolutionize genome-wide association studies 
(GWAS). The recently completed phase of GWAS 
mainly addresses common SNPs with Minor allele fre- 
quency (MAF) >5%, based upon the common disease/ 
common variant (CD/CV) hypothesis (3). However, the 
identified common variants explain only a small propor- 
tion of heritability (4). Rare variants therefore have been 
hypothesized to account for the missing heritability (5,6). 
To identify rare variants, a direct and more powerful 
approach is to sequence a large number of individuals 

(7) . This line of thought also implicitly motivates the 
recent 1000 Genomes Project, which will sequence the 
genomes of 1200 individuals of various ethnicities by NGS 

(8) . It is expected to extend the catalog of known human 
variants down to a frequency 

Although the cost of whole-genome or exome sequenc- 
ing of all enrolled subjects is prohibitively high now, such 
studies will eventually be carried out in a manner similar 
to GWAS with very large sample sizes (9). While the cost 
is being brought down to as low as SI 000 for sequencing a 
whole genome (10), in the interim, a cost-effective strategy 
has to be taken in order to take the full advantage of 
NGS. Such issues with cost and labor are not new as 
similar problems were confronted in the early expensive 
stage of GWAS and were circumvented by focusing on 
small candidate regions and the use of pooling of genomic 
DNA (11,12). Borrowing the same idea, many targeted 
re-sequencing applications utilizing pooling have been 
seen in the past few years (13-16). 

The first-step analysis of NGS data for genetics study is 
often to identify genomic variants among sequenced 
samples. Quite a few SNP calling tools have been 
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implemented to identify SNPs from sequencing of indi- 
vidual genomes. SNP calling is a relatively straightforward 
problem in analysis of sequencing data of individual 
genomes, because the frequency of a candidate allele can 
be only 0 (non-variant), 0.5 (heterozygous) or 1 (alternate 
homozygous) for a diploid genome. Despite (high) 
sequencing error of NGS, a reliable call can be easily 
made given a high depth of coverage, say 20 x to 30 x. 
Consequently, statistical models for SNP calling have 
been developed and integrated as one simple functional 
module in many NGS short reads analysis tools such as 
SAMtools (17), MAQ (18), GATK (19) and VarScan (20). 
SAMtools and MAQ use a Bayesian statistical model to 
compute the posterior probabilities of the three possible 
genotypes. Specifically, for the likelihood part, they 
employ a binomial distribution to characterize sampling of 
the two haplotypes, and the prior probability, like other 
Bayesian approaches, is pre-specified. SAMtools and 
MAQ empirically set the prior probability of observing 
a heterozygote to be 0.001 for the discovery of new 
SNPs, and 0.2 for inferring genotypes at known SNP 
sites. A similar Bayesian algorithm is used by GATK 
followed by sophisticated filtering. Such Bayesian 
approaches may not be ideal for multiplicity control 
because of the subjectivity of assigning the prior probabil- 
ity. VarScan implements a heuristic/statistical method. 
For each candidate site, it applies several heuristic filters 
such as having a minimum number of supporting reads 
and allele frequency reaching a minimum threshold. It 
also conducts a Fisher's exact test for testing the deviation 
of the read counts supporting variant alleles from being 
generated because of sequencing error. Those heuristic 
filters overlap with the Fisher's exact test in terms of 
reducing false positives. When not systematically con- 
sidered, they may distort the statistics distribution under 
null and thus void the resultant P-values for multiplicity 
control. The variant call program we develop here is based 
on a frequentist approach, which will systematically 
consider all relevant factors and output P-values valid 
for multiplicity control. 

Identifying SNPs from pooled NGS data is more 
challenging in that pooled DNA is sampled from a 
number of individuals, which consequently will give rise 
to variant allele frequencies other than simply 0, 0.5 or 1. 
Driven by the need for analysis of increasing amount of 
pooled NGS data, several programs/methods for the de- 
tection of variants from the pooled data have been de- 
veloped. SNPSeeker employs the large deviation theory 
for SNP detection (21). It compares observed allele 
frequencies against the distribution of sequencing errors 
as measured by the Kullback Leibler (KL) distance (22). 
One limitation of this approach is that its error model has 
to be estimated from negative control data. SNPSeeker 
was recently extended to SPLINTER with two main im- 
provements (23). First, it is capable of detecting rare short 
indels. Second, it provides a good cutoff after ranking all 
candidate variants to balance power and type I error rate, 
which, however, requires an additional positive control 
data. CRISP (24) models the number of reads of the ref- 
erence and alternate alleles at a particular position across 
all pools as a contingency table, which is then tested by the 



Fisher's exact test. Its working hypothesis is that, due to 
rareness, presence of rare variants in all pools will be 
sporadic and then results in an excess of reads with the 
alternate allele as compared with the other pools, which is 
expected to be captured by the Fisher's exact test. CRISP 
then conducts a complementary test for the overabund- 
ance of alternate alleles within each pool against the 
sequencing error rate. Although it is shown that CRISP 
outperforms SNPSeeker, MAQ and VarScan (24), it has 
the following limitations. First, its working hypothesis 
does not hold well for common variants. When the 
MAF is large and/or the number of individuals in each 
pool is large, sporadic presence will disappear and result in 
no prominent excess of reads that can be captured by the 
Fisher's exact test. Second, their method is not applicable 
for single-pool data. Third, rareness and overabundance 
of alternate alleles are related but are captured separately 
using two different models, which may not be an efficient 
approach. In addition, these two separate tests make it 
hard to obtain an overall multiplicity control. Finally, 
its computational efficiency makes scalability an issue 
and may prevent its application in analysis of whole- 
exome or genome sequencing data. The main bottleneck 
comes from computing the P-value of a large number of 
contingency tables in the Fisher's exact test. 

In addition to the above direct SNP calling programs, 
there are also other relevant studies for analysis of pooled 
NGS data, including estimating allele frequencies from 
pooled sequencing (25), evaluating the ability to detect 
rare SNPs (15) and investigating the power of variant de- 
tection in pooled DNA for NGS and the optimal pooling 
designs (26), among others. In this article, we develop a 
statistical tool SNVer (single nucleotide variant caller/ 
seeker) for detecting variants in analysis of NGS data. 
SNVer is applicable to both pooled and individual data, 
and in particular it addresses the limitations that 
pre-existing methods have. 



MATERIAL AND METHODS 

Statistical models for single-pool data 

For a genomic locus, let 6 be its MAF in a population. If 9 
is larger than a threshold 0 O (6 > 0 O ), then we call it a single 
nucleotide polymorphism (SNP). Suppose that we sample 
N individuals (haploids) from this population for pooled 
sequencing. We assume that the number of individuals («) 
carrying the minor allele follows a binomial distribution 
b(N, 9), namely, 

n ~b{N,0) 

with 

Prob(«;<9) = f^V(l - Sf~" 

Now we re-sequence this genomic region. Suppose that 
K short reads cover this locus, if no sequencing error, 
given n individuals carrying the minor allele, the number 
of minor alleles X we observe from the K short sequence 
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reads follows also a binomial distribution b(K, n/N), 
namely: 

X ~ b(K,n/N) 

with 



Prob(X|«) 



K\fn 
X 



Now we assume sequencing error rate to be e, 
under which the minor allele will be flipped to one of 
the other three alternate alleles, and vice versa. So 
the observed X follows a binomial distribution 
b(K,(n/N)(l - s)+(l - n/JV)(e/3)), namely, 

V N y N 3 



with 



. n .. . N — m 

- - 1 - s)+ 

7V V ; JV 3 



JST-Jf 



Since « is not observable, we sum it out and obtain the 
statistical model for X as 

N 

ProbfJT; 6) = £ Prob(X|«)Prob(«; (9) 



n=0 



n=0 



D;)It,(i-«> 



N—m 



, n , N - nz 
^ (1 - £)+ ^V-3 



JV 3 



£)Q'('-£) 



Now we consider the hypothesis test of whether this 
locus is a (rare) variant (6 > 6 0 ) 

H 0 : 9 <6 0 versus Hi:6>6 0 

Its significance P-value will be 

P = Prob (X > x; 6 = 6 0 ) = 1 - Prob (X < x; (9 = 0 O ) 

Partial conjunction test for multiple-pool data 

The above statistical model is for testing a locus in one 
single-pool data. For M pools, we propose to test it in 
each pool separately. We therefore obtain a set of M 
hypotheses for each candidate variant. The problem of 
making a variant call at one specific locus involves the 
simultaneous testing of hypotheses at the set level. 
Typical questions considered in the multiple-testing frame- 
work include: (i) Are all M hypotheses in the set true? 
(ii) Are all M hypotheses in the set false? (iii) Are at 
least u out of M hypotheses in the set false? These ques- 
tions are referred to as conjunction test, disjunction test 



and partial conjunction test, respectively (27). Testing 
whether a locus is a variant based on multiple-pool data 
is equivalent to the partial conjunction test that at least 
u = 1 out of the M hypotheses for that locus is false. Let 
be the ordered /-values obtained from 



p m> p (2), 



(M) 



each single-pool test. Following (27), we employ the Simes 
method to calculate the pooled /-value for the partial 
conjunction test as 



p l/M — min 



M 

— j 
J 



(f)J 



1, 



M 



If the set of M null /"-values at the tested locus are 
independent, Benjamini and Heller (27) show that p l/M 
is a valid P-value for testing the partial conjunction null. 
The Benjamini Hochberg (BH) procedure (28) and other 
multiple-test adjustments can then be applied to the 
pooled Simes' P-values for multiplicity control when 
testing a large number of loci. It has been shown that 
this Simes-BH procedure controls the false discovery 
rate (FDR) at the pre-specified nominal level (27). 

Data sets 

Simulated data. We simulate synthetic data to investigate 
the numerical performances of our approach. For the 
single-pool scenario, a total of 10000 data sets are gene- 
rated under each combination of several conditions: 

• Sequencing coverage: low (lOx) and high (30x). 

• Sequencing error: low (0.01) and high (0.05) 

• MAF: rare variants with 0~U(O.OO1, 0.01), less 
common variants with #~U(0.01, 0.05) and very 
common variants #~U(0.05, 0.5) 

• The number of sequenced individuals from low to high 
with N = 10, 20, 50, 100, 200, 500, 1000, 1500, 2000 

For each MAF setting #~ U(6 m i n , # max ), we calculate 
the power of our approach for detecting variants by testing 
the null hypothesis H 0 : 6 < 0 m \ n . Meanwhile, to demon- 
strate that type I error is controlled at the nominal level 
by our proposed test, we simulate 9~U(0, 0 min ), and 
evaluate how likely the same null hypothesis H 0 : 6 < 9 m m 
will be rejected by mistake. For both power and type I 
error evaluations, we call a variant at the nominal level 
0.05. 

For the multiple-pool scenario, we follow the above 
single-pool simulation settings except that we simulate 
five pools with the same number of individuals in each 
pool and the total N = 10, 20, 50, 100, 200, 500, 1000, 
1500, 2000. 

Real data. We also assess the performance of our method 
in analysis of two pooled and one individual real NGS 
data sets as summarized in Table 1. The first one was an 
in-house Autism data set generated using ABI SOLiD 
platform from sequencing three genomic regions, denoted 
as Core, CDH9 and CDH10, of size 187, 158 and 158 kb, 
respectively, on chromosome 5 of the human genome. We 
made 24 pools with six individuals in each, totaling 144 
samples. We have 12 pools for Autism case samples and 
the other half 12 pools for control samples. One case pool 
experiment failed and we therefore have 23 pools in total 
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Table 1. Summary of T1D and Autism pooled sequencing and ADHD individual sequencing data sets 



Disease 


Platform 


Total 
reads 


Reads 
length 


#Pool 

Case 


Ctrl 


#Individual 
per pool 


Region 


Coverage per 
individual 


Autism 


SOLiD 


-402 M 


50 bp 


11 


12 


6 


-503 kb 


-90 x 


T1D 


454 


-9.4 M 


-250 bp 


10 


10 


48 


-31 kb 


~80x 


ADHD 


niumina 


-57 M 


76 bp x 2 


three individuals 




-38 Mb 


-20 x 



for analysis. We aligned short sequence reads by the 
Bioscope software from AB1 SOLiD with default param- 
eters. The mapped short sequence reads cover >96% of 
the three target regions with average 90 x depth of cover- 
age per individual. Meanwhile, we collected individual 
genotyping data for each sample, which were generated 
from Illumina HumanHap550v3 SNP arrays with ap- 
proximately 550 000 markers. With individual genotyping 
data, we may calculate the concordance of identified 
variants between pooled sequencing data and individual 
genotyping data for evaluating variant call quality. 

The second data set was collected in a recent study of 
causative Type 1 Diabetes (T1D) variants (14). Exons and 
splice sites of 10 candidate genes were re-sequenced by the 
454 sequencing system. Ten pooled samples each com- 
prising equal amounts of DNA from 48 T1D patients 
and 10 pooled samples each comprising equal amounts 
of DNA from 48 healthy controls were made, totaling 
480 T1D patients and 480 healthy controls from Great 
Britain. For each of the 20 pooled DNA samples, the 
numbers of produced short reads range from 28 1 270 to 
579 102, with average length of 250 bases and 9 416 365 
reads in total. We mapped these reads by BWA-SW (29) 
with default parameters and the average depth of coverage 
is 80 x per individual. 

The third one was an in-house individual sequencing 
data set. We performed paired end exome sequencing on 
three members affected with attention deficit/hyperactivity 
disorder (ADHD) in a pedigree, using the Illumina 
Genome Analyzer IIx platform with read lengths of 
76 bp. It targets all human exonic regions totaling 
~38 Mb. We aligned the short reads by BWA with default 
parameters and removed duplicates by picard (http:// 
sourceforge.net/projects/picard/). These mapped and 
cleaned short reads were then re-aligned locally by the 
GATK IndelRealigner tool (30). The average depth of 
coverage is ~20x for each patient. Meanwhile, we also 
collected the genotyping data of these three patients, 
generated from the Illumina Human610-Quad version 1 
SNP arrays with -610000 markers (including ~20000 
non-polymorphic markers). 

For pooled sequencing data, CRISP has been shown to 
outperform other existing methods (24), so we focus on 
the comparison of our program with CRISP in perform- 
ance evaluation. We also include SAMtools for compari- 
son although it is not designed for pooled sequencing 
data. For the ADHD individual data, we compare 
SNVer with SAMtools and GATK. Variant positions were 
called and filtered by SAMtools with all default settings 



plus using awk '(S3 = = "*" &$6> = 50) || ($3! ='*' 
&$6> = 20)', as suggested by the SAMtools website. 
For the ADHD data, SAMtools with the suggested setting 
returned so many variants that we also report SAMtools 
results with an additional filtering — d20 to remove variant 
calls with sequencing coverage less than 20, for getting 
comparable numbers of variant calls as SNVer. We also 
called variants using the GATK UnifiedGenotyper, 
followed by further filtering based on the latest recommen- 
dations from the authors of GATK (see Supplementary 
Data for the detailed settings). SNVer utilizes SAMtools 
(17) to process and pile up mapped short reads. CRISP 
has its own pileup procedure integrated in its analysis 
pipeline. To make a fair comparison, following CRISP 
(24), we perform similar quality control and set the same 
processing parameters such as mapping quality and base 
quality filtering thresholds. 

RESULTS 

Power and type I error evaluations 

The single-pool results are shown in Figure 1 . We can see 
that our method can control type I error rate at the 
nominal level 0.05 in all settings. The number of sampled 
individuals (sample size) and the depth of coverage are 
both shown to be helpful in improving power. The 
largest improvement of ~10% attributed to depth of 
coverage (from lOx to 30 x) is observed in the rare 
variants and high sequencing error (up-right panel). The 
improvement contributed by larger sample size keeps 
increasing at a decreasing rate until saturated. These 
power improvement curves would be helpful for pooling 
experiment design and provide guidance as to how to bal- 
ance sample size (cost) and desired power. As expected, 
rare variants are much harder to be detected than 
common variants. A large sample size is required for 
achieving high power to detect them. Finally, higher 
sequencing error (0.05 versus 0.01) puts a small dent to 
power. 

Figure 2 shows similar results for the multiple-pool 
scenario. Again, type I error rate is controlled at the 
nominal level 0.05. We also observe that given the same 
number of sequenced individuals, single-pool design yields 
a bit higher power with lower type I error rate in compari- 
son with multiple-pool design, for example, 1000 individ- 
uals using one single pool versus five pools with 200 
individuals in each. CRISP selects candidate SNPs by 
the Fisher's exact test, which is then followed by addition- 
al filtering steps. In the multiple-pool scenario, we show 
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Figure 1. Power (PW) and Type I error rate (Err) of SNVer using single-pool data at low (lOx) and high (30x) coverage. 
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Figure 2. Power (PW) and Type I error rate (Err) of SNVer using multiple-pool data at low (lOx) and high (30x) coverage. 



2000 



that the rankings of candidates SNPs by our test is 
superior to those by the Fisher's exact test employed by 
CRISP. To compare the efficiencies of these two rankings, 
we divide the 10000 positives with #~ U(9 m i n , 0 max ) and 
10000 negatives with 0~U(O, # min ) into 100 groups, each 
with 100 positives and 100 negatives. These 200 loci are 



then ranked by their significance levels of testing the null 
H 0 : 6 < f? min using our statistical models. Rankings based 
the Fisher's exact test are also generated. The area under 
the curve (AUC) score averaged over 100 groups is used to 
evaluate these two rankings as shown in Figure 3 for the 
typical scenario of 30 x coverage and 0.05 sequencing 
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Figure 3. Ranking efficiency of the binomial models employed by SNVer versus the Fisher's exact test employed by CRISP. 



error. We can see that the Fisher's exact test is very inef- 
ficient for detecting common and less common variants. 
CRISP therefore has to rely on additional sequencing 
error models to complement the Fisher's exact test for 
detecting common variants. We apply the BH procedure 
to control FDR at the nominal level of 0.1 and 0.05. As 
shown in Supplementary Table SI, the FDR for the 
Fisher's exact test is inflated, particularly dramatically 
for common and less common variants; SNVer 
controls the FDR very well. The number of sequenced 
individuals is modeled in our test and is shown to be 
helpful. This information is not explicitly utilized by 
CRISP in its Fisher's exact test and therefore contrib- 
utes very little for detecting common and less com- 
mon variants, although CRISP models it at the later 
filtering step. 



The accuracy of allele frequency estimation has an 
impact on variant call, and is more critical for estab- 
lishing association in genetics studies. Therefore we 
also plot the estimated MAF against the actual MAF 
when e = 0.01 in Figure 4. For a moderate sample size 
of 250, we observe good concordance with correlation 
coefficients r 2 = 0.9828 and r 2 = 0.9318 for the single-pool 
design and the multiple-pool design, respectively. When 
the sample size increases to 1000, the concord- 
ance improves to r 2 = 0.9955 and r 2 = 0.9769 for the 
single- and the multiple-pool design, respectively. The 
lower concordance of the multiple design may be 
attributed to its additional between-pool variance. It 
also explains why singe-pool design yields fewer false posi- 
tives than the multiple-pool design for the same set of 
samples. 
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Figure 4. Correlation between the minor allele frequencies and its estimates in pooled sequencing. 
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Real data application 

Better performance. The user of SNVer only needs to set 
the sequencing error rate s and the variant threshold 0 O . 
SNVer will then report the significance ^-values of the 



tested loci of how likely their MAF 6 < 9 0 We assume 
£ = 0.01 for all real data sets. CRISP calls both rare and 
common variants, so we set 0 0 = 0 for SNVer to compare 
their performance in calling variants. CRISP will output 
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the variants it calls, while SNVer will report overall sig- 
nificance P-values for each locus, based on which the user 
can choose a threshold he/she feels appropriate and make 
variant calls. To make a comparison, we rank loci by their 
P-values output by SNVer and take the significance 
threshold that gives the same number of variants called 
by CRISP. The loci identified as variants by these two 
programs are then annotated by SeattleSeq (http://gvs.gs 
.washington.edu/SeattleSeqAnnotation/), and we count how 
many of them have been confirmed as variants in dbSNP. 
Following (30), we evaluate variant call quality by 
examining dbSNP rate, transition/transversion (Ti/Tv) 
ratio and concordance of sequencing and individual 
genotyping calls. A higher Ti/Tv ratio generally indicates 
a higher accuracy; this metrics is particularly helpful for 
assessing novel single nucleotide variant calls (30). The 
variant call results are summarized in Table 2. For the 
Autism and T1D pooled sequencing data sets, SNVer 
has the higher dbSNP rates, the higher overall Ti/Tv 
ratios and the higher Ti/Tv ratios for new sites, in com- 
parison with CRISP. It indicates the better quality of the 



call sets SNVer produced. In contrast, SAMtools made 
much fewer SNP calls which led to much lower 
sensitivities, despite its higher Ti/Tv ratios. Out of the 
110 SNPs that have been genotyped by SNP arrays in 
the Autism data set, SAMtools identified only 16 SNPs 
with 100% genotyping concordance, while both SNVer 
and CRISP called about 100 SNPs with 100% genotyping 
concordance. This confirms that SAMtools may not be 
appropriate for pooled sequencing data. The correlation 
between alternate allele frequencies in individually geno- 
typed DNA samples and frequency estimates in the se- 
quenced DNA pools is plotted in Figure 5, with 
r = 0.92 and r 2 = 0.94 for the Autism case and control, 
respectively. The achieved 100% genotype concordance 
with less perfect frequency estimates is not surprising 
because accurate estimate of allele frequency 6 is only 
critical for rare variants when testing 6 > 0. 

As shown in Table 2, for the ADHD individual 
sequencing data, under family-wise error rate 0.05 level, 
SNVer also obtained the variant call sets with good 
quality. This is evidenced by the ~97% dbSNP rates, 



Table 2. Comparison of SNP calling by CRISP, SAMtools, GATK and SNVer 

Data No. of SNP Ti/Tv a Concordance b 

All Known Novel dbSNP% All Known Novel TP/P (%) 



Autism (pooled) 
Case 

CRISP 

SNVer 

SAMtools 
Control 

CRISP 

SNVer 

SAMtools 
T1D (pooled) 
Case 

CRISP 

SNVer 

SAMtools 
Control 

CRISP 

SNVer 

SAMtools 
ADHD (Individual) 
84060 

SNVer 

SAMtools 

SAMtools 20 * 

GATK 
84615 

SNVer 

SAMtools 

SAMtools 20x 

GATK 
92157 

SNVer 

SAMtools 

SAMtools 20 * 

GATK 



2182 
2182 
261 

2063 
2063 
239 



306 
306 
14 

167 
167 
18 



18001 

48 988 
15038 
19 655 

17436 
46 037 
15510 
18 892 

18 676 

49 729 
15881 
20100 



1791 
1795 
260 

1610 
1617 
238 



93 
126 
9 

110 
120 
12 



17 535 
47 513 
14538 
19713 

16914 
44489 
14942 
18419 

18 208 
47 693 
15 370 
19631 



391 
387 
1 

453 
446 
1 



213 
180 

5 

57 
47 
6 



466 
1475 
500 
482 

522 
1548 
568 
473 

468 
2036 
511 
469 



82.1 
82.3 
99.6 

78.0 
78.4 
99.6 



30.3 
41.2 
64.3 

65.9 
71.9 
66.7 



97.4 
97.0 
96.7 
97.6 

97.0 
96.6 
96.3 
97.5 

97.5 
95.9 
96.8 
97.7 



1.68 
1.71 
2.26 

1.68 
1.78 
2.06 



0.95 
1.71 

10/4 

1.49 
2.34 
14/4 



2.89 
2.66 
2.70 
2.91 

2.85 
2.64 
2.74 
2.89 

2.90 
2.69 
2.80 
2.98 



1.79 
1.81 
2.29 

1.83 
1.89 
2.05 



2.58 
2.15 

8/1 

2.93 
3.00 
11/1 



2.89 
2.68 
2.72 
2.94 

2.87 
2.67 
2.77 
2.92 

2.92 
2.73 
2.83 
3.00 



1.26 
1.35 

0/1 

1.27 
1.45 

1/0 



0.63 
1.47 

2/3 

0.46 
1.35 

3/3 



2.73 
2.16 
2.11 
2.15 

2.22 
1.94 
2.02 
2.03 

2.37 
2.03 
1.99 
2.35 



101/101 (100) 
102/102 (100) 
16/16 (100) 

96/96 (100) 
95/95 (100) 
16/16 (100) 



N/A 



4158/4183 (99.4) 
4437/8116 (54.7) 
2034/3158 (64.4) 
4649/4657 (99.8) 

4032/4063 (99.2) 
4173/7643 (54.4) 
2062/3247 (63.5) 
4537/4566 (99.4) 

4192/4224 (99.2) 
4251/7996 (53.2) 
2028/3259 (62.2) 
4700/4710 (99.8) 



"Transition and transversion ratio for the identified variants. When the number of variants is small we just report the numbers but not calculate the 
ratio, e.g. 10/4 for all variants in T1D case by SAMtools means 10 transitions and 4 transversions. 

b Genotype concordance. P represents the number of variants called by each program and also genotyped. TP represents the number of variant calls 
concordant between sequencing data and individual genotyping data. 
20x: Additional filtering of sequencing depth >20 is applied. 
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Figure 5. Correlation between alternate allele frequencies in individually genotyped DNA samples and its estimates in the sequenced DNA pools for 
the Autism data set. Different symbols represent different depth of coverage ranges as shown in the legend. 



the approximately 2.9 overall Ti/Tv ratios, the 2.22-2.73 
Ti/Tv ratios for novel sites, and the 99% genotype con- 
cordance. SAMtools with suggested parameters/filters 
made 2+ times more variant calls than SNVer (e.g. 
~49K versus ~18 K). The lower Ti/Tv ratios and 
genotype concordance suggest poorer quality for these 
larger call sets made by SAMtools. When applied with 
an additional filtering of sequencing depth >20x, 
SAMtools identified fewer SNPs than SNVer. But it still 
has lower quality as indicated by the lower Ti/Tv ratios 
and genotype concordance. Compared with GATK, 
SNVer has similar performance, while with the higher 
Ti/Tv ratios for novel variants in all three individuals. 

We note that the Ti/Tv ratios for novel variants in the 
pooled sequencing data are low for both programs. It 
suggests that they may not perform well for novel variants 
if we estimate the false-positive rates based on the Ti/Tv 
ratios following (30). It confirms that variant calling is 
more challenging for pooled sequencing. Meanwhile, 
estimating false-positive rates using this summary statistic 
should be cautious for pooled sequencing. First, Ti/Tv 
estimate for pooled samples is not as accurate as for indi- 
vidual samples. Second, targeted resequencing regions are 
usually small, e.g. 31 kb for the T1D data and 503 kb for 
the Autism data, and therefore may exhibit higher genomic 
and statistical variances. For example, the ADHD indi- 
vidual 840 60 has an exome-wide Ti/Tv ratio of 2.89 for all 
variants; if we calculate Ti/Tv ratios based on only 500-kb 
regions, then the smallest Ti/Tv ratio we obtain is 1.31, 
and the largest 7.00 with SD = 1.53 (we consider only 
500-kb regions with at least 30 variants for having stable 
Ti/Tv ratio estimates). 

Better scalability. SNVer and SAMtools exhibit similar 
efficiency in terms of running time. The running time of 
SNVer and CRISP in analysis of the T1D and Autism 
data sets is given in Figure 6. The main bottleneck of 
CRISP comes from computing the P-value of a large 
number of contingency tables in the Fisher's exact test. 
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Figure 6. (a and b) Comparison of running time of SNVer and CRISP 
for testing testing (a) the T1D 31 kb region and (b) the Autism 503 kb 
region. Running time of SNVer is mainly determined by the region size 
(the number of tests), while larger pool numbers and sequencing depth 
will take additional time for CRISP. 



Therefore, in additional to the number of tests, its time 
efficiency is also largely dependent on the number of pools 
and the depth of coverage. In contrast, these two factors 
have little impact on SNVer and its running time is 
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roughly linear with the region size (the number of tests). 
For example, SNVer spends 0.1 h on 31 kb and 1.5 h on 
503 kb for the two data sets, respectively. SNVer is much 
faster than CRISP. Taking the T1D case for example, 
SNVer is ~500-fold faster than CRISP and achieves 
300kb/h. Such efficiency makes feasible the application 
of SNVer to analysis of whole-exome sequencing data, 
or even whole-genome sequencing data using high per- 
formance computing cluster, both of which, however, 
will take prohibitively longer time for CRISP. 

Informative ranking and multiplicity control 

SNVer reports one single overall significance P-value for 
each locus, based on which the rankings of all tested loci 
can be produced. Such rankings are more informative and 
accurate than the dichotomous decision of whether to 
'accept or reject the candidate as a variant' provided by 
CRISP and most other existing methods. For example, 
four rare variants have been found to be associated with 
T1D based on the T1D data set by comparing the 
estimated MAF in cases and controls (14). We use 
SNVer to call these four variants by testing the null hy- 
pothesis 9<9 0 = 0.01. We give the rankings of them by 
SNVer in Table 3, as well as the dichotomous decisions 
made by CRISP. For SNVer, we observe very significant 
ranking changes of these four SNPs, which are consistent 
with their MAFs (relative to the threshold 0.01) and the 
MAF differences. CRISP identifies three of them, 
rs35337543, ssl07794688 and ssl07794687, as variants in 
both cases and controls, exhibiting no informative differ- 
ential changes. It should be noted that the ranking differ- 
ence may only reflect frequency difference. Large 
frequency difference between case and control of those 
variants may suggest their potential association with the 
phenotype, but their functional importance to the pheno- 
type is yet to be assessed by further experiments. 

In addition to ranking, valid P-values given by SNVer 
also make multiplicity control possible. Tens of thousands 
or millions loci are usually simultaneously examined in 
typical NGS experiments. It is particularly desirable to 
have multiplicity control, which gives the user an idea of 
the chance of making any errors and/or the proportion of 
false positives among the variant calls they make. Each 
user can choose the type I error rate threshold he or she 
considers appropriate, instead of just the dichotomous de- 
cisions of whether to 'accept or reject the candidates' 
provided by most existing methods. 



Table 3. Informative rankings of four rare variants with the null 
hypothesis 0<6 O = 0.01 



SNP 




T1D case 




T1D control 






Estimated SNVer 


CRISP 


Estimated SNVer 


CRISP 




MAF (%) ranking 


CALL 


MAF (%) ranking 


CALL 


rs35337543 


0.36 


17 557 


Y 


2.51 45 


Y 


rs35667974 


0.72 


17 557 


N 


2.42 59 


Y 


ssl07794688 


0.50 


17 557 


Y 


1.79 56 


Y 


ssl07794687 


1.07 


145 


Y 


2.45 51 


Y 



DISCUSSION 

We have developed a novel statistical tool SNVer for 
calling SNPs in analysis of pooled or individual NGS 
data. Different from the previous models employed by 
CRISP, it analyzes common and rare variants in one 
integrated model, which considers and models all 
relevant factors including variant distribution and 
sequencing errors simultaneously. As a result, the user 
does not need to specify several filter cutoffs as required 
by CRISP. Some variant calling methods simply discard 
loci with low depth of coverage to achieve reliable variant 
calls. Our statistical model does not discriminate against 
poorly covered loci. Loci with any (low) coverage can be 
tested and depth of coverage will be quantitatively 
factored into the final significance calculation. SNVer 
reports one single overall significance P-value for 
evaluating the significance of a candidate being a 
variant. An advantage of reporting results on a more con- 
tinuous scale, instead of just the dichotomous decision of 
whether to 'accept or reject the candidate as a variant' as 
most existing methods do, is that the user can choose the 
alpha threshold he or she considers appropriate. We have 
used both simulated data and real data to demonstrate the 
superior performance of our program in comparison with 
pre-existing methods. Although SNVer is motivated by 
the need for analysis of pooled NGS data, it can also be 
applied to individual NGS data as a special case (N = 2 
for diploid species), as shown in the ADHD data set. 

Sampling bias is a non-trivial problem in pooled se- 
quencing, and in particular, rare variants are prone to 
sampling issues. Properly considering it may further im- 
prove the power. In this article, to make inference of the 
MAF 6 of each site, we model the number of observed 
alleles conditional on the coverage from a frequentist 
standpoint. The power of detecting variants may be 
further improved if sampling bias is modeled properly so 
that we have more informative inference of the coverage 
rather than conditional on it. Since we have only one ob- 
servation for each site, to model sampling bias or make 
any site-specific inference, e.g. base quality/error, we have 
to pool information across sites. Bayesian models may be 
a better, if not the only, way to this end. For example, the 
distribution of coverage of all sites can be approximated 
by the Gamma distribution for Illumina's short read align- 
ments (31). Shen and colleagues (32) propose to estimate 
the posterior error rates for each substitution through a 
Bayesian formula, in which error models are learned from 
training data sets. Our frequentist approach does not 
model sampling bias; however, it has its own merits. 
First, the sampling bias issue may be very application 
specific. Different target enrichment kits may have differ- 
ent coverage uniformities. More variant sampling bias is 
expected for targeted re-sequencing, the current main 
pooling application, due to region-specific GC content. 
Mapping algorithms will also critically impact coverage. 
As a result, any approaches with sampling bias modeled 
may have to check carefully whether the sampling bias 
model/distribution fits well for every application. 
Second, our frequentist approach does not pool informa- 
tion across sites, which consequently has minimal 
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requirement for input and wider applications. For 
example, when only one or few sites are tested, and 
without any help from external training data, sampling 
bias could not be modeled (well), but our frequentist 
approach still can be applied. 

So, sampling bias is not considered in our frequentist 
approach, which consequently makes few assumptions, 
requires minimal input, and thus has wider applications. 
On the other hand, sampling issues may be addressed 
by more careful pooled re-sequencing designs (33). 
Companies such as NimbleGen and Agilent are also com- 
peting to improve their target enrichment kits to obtain 
coverage uniformity. With these upstream efforts, sam- 
pling bias may have a minimized impact on downstream 
variant call algorithms. 

Our current program can be improved and extended in 
several ways. First, small indels are not supported. Indels 
impose a great challenge for NGS including DNA ampli- 
fication and reads mapping which are under fast develop- 
ment. When those techniques become mature in handling 
indels, we may investigate their distribution and work out 
a proper calling strategy. Second, sequencing quality 
scores can be utilized to estimate site-specific sequencing 
error. Third, the majority loci of sequenced segments are 
known to carry no variants. The density of SNP is esti- 
mated to be around 1 out of 1000 bases. Such prior per- 
centage of non-nulls information may help obtain more 
precise multiplicity control. Fourth, the dependency 
among tests will also be informative in increasing testing 
efficiency. We have shown that the LD dependency infor- 
mation is very informative in increasing the efficiency of 
conducting genome-wide association tests in analysis of 
GWAS data (34). We also found recently that dependency 
information is helpful for increasing the efficiency of 
testing hypotheses at the set level (35). For NGS data, 
one non-null (variant) is expected from every 1000 con- 
secutive genomic bases. Such dependency patterns, if ap- 
propriately modeled, may help further improve testing 
efficiency. Lastly, our current program focuses on calling 
variants, namely, testing whether 9 is larger than a thresh- 
old. Under the same framework, our models can be nat- 
urally extended for case-control association studies by 
testing whether # case = 0 con troi- We are currently working 
on these extensions. 

In summary, we have developed a statistical tool SNVer 
for calling common and rare variants in analysis of both 
pooled and individual NGS data. As more and more NGS 
data become available, we expect more applications of our 
program. 
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